Let’s call the help function of library and see what happens.
help(library)
You can add a new chunk of code by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
Let’s take a quick look at how are notebook looks like right now.
Tidyverse - Packages for data wranglers
Tibble - Tibbles as dataframes
stringr - managing strings
Tidyr - pivot_longer: Pivot wide data into long format (i.e. “melt”) - pivot_wider: Pivot long data into wide format (i.e. “cast”) - separate: Separate (i.e. split) one column into multiple columns - unite: Unite (i.e. combine) multiple columns into one
Regressions:
Output:
So, as it turns out, you’ve already been working with a package within the Tidyverse!
tidyverse is a collection of packages. Many data scientists use it - and it’s basically a bunch of packages written in similar style with a similar philosophy of approaching data structures.
This book is an additional resource to help you splice, groupby, mutate and dice up your data.
You can install this with
#install.packages("tidyverse")
and then load it up with
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.4.2 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.1
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Once you have this loaded, you have the following packages included:
For importing, it also automatically imports a bunch of other packages that cover:
SAS read_sas(“mtcars.sas7bdat”) write_sas(mtcars, “mtcars.sas7bdat”)
SPSS read_sav(“mtcars.sav”) write_sav(mtcars, “mtcars.sav”)
Stata read_dta(“mtcars.dta”) write_dta(mtcars, “mtcars.dta”)
jobs_2 = read_csv(‘job-automation-probability.csv’) jobs_2 = read_delim(‘job-automation-probability.csv’, delim = ‘,’) write.csv(jobs, ‘jobs2.csv’)
A cheat sheet exists for A LOT of packages - you can take a look here
Let’s take a look at tibble. It is just a dataframe. Really not much to it.
The thing is, at first you will be annoyed. There will be a lot of errors that you are going to have to confront. The idea is that instead of R just figuring it out for you (such as columns automatically turned into strings where there are strings), it will show an error.
This is good. Because it makes you have to deal with issues with your data now - instead of dealing with an error that you don’t know why it’s giving you an error in a package or function later on.
Let’s import the dataset that we’ll use today - growth rate by country - using the R package WDI. Be sure to install packages as install.packages(WDI)
Honestly, tibble is pretty much a pandas data frame. It displays the first 10 lines and takes your data as is so that you can work from it from the ground up.
#install.packages("tidyverse", "wdi")
#notice that you can import a few packages in one line
library("WDI")
wdi <- WDI(country = "all", start=2000, end=2015, extra="TRUE",
indicator=c("NY.GDP.MKTP.KD.ZG"))
Search for indicators that contain the word ‘gdp’
gdp_indic <- WDIsearch('gdp')
wdi <-as_tibble(wdi)
#it's just that easy
A lot more about tibble can be found here
We can extract columns
python df.columname
#wdi$NY.GDP.MKTP.KD.ZG
Notice that Rstudio shows you all of the columns after you input the dollar sign.
dplyr plays nice with tibble Let’s rename a column
wdi <- rename(wdi, gdp= 'NY.GDP.MKTP.KD.ZG')
So, know that you are using tibble, you’re going to need to know some basics of how to deal with strings.
It’s particularly useful when you want to search or filter within a string.
In Python, if you assign an item a string, you can split, filter,
etc. Typically, in pandas, we will assign a column a string or tell
pandas we we want string operations on a column (like
df['columnname'].str.replace(',',''))
This is not so dissimilar in R.
We us the function filter and apply the string detect function to search for a particular word (in this example)
wdi %>% filter(str_detect(wdi$country, "World"))
You can use these options to help you match specific cases
. to match any character (except a newline)
^ to match the start of the string.
$ to match the end of the string.
So, if we want to have countries that start with “United”, we can do this as:
wdi %>% filter(str_detect(wdi$country, "^World"))
Similar to pandas, you’ll often need to clean a column of strings or other symbols (particularly if it’s something you want to be integers). You’ll use the str_replace or str_replace_all
str_replace_all(df$columname, "things to replace", "replace it with something")
#Don't forget to assign it!
wdi$gdp <- str_replace_all(wdi$gdp, "[,$!]", "")
A whole chapter to this that you should read is here
R follows a set of conventions that makes one layout of tabular data much easier to work with. We usually call this format “Long”, and it will always have:
So, generally, you’re trying to get your data into that point. Luckily, there are a few commands to help you with this.
There are lots of ways to do this, though: reshape(), melt() [from wide to long], cast() [from long to wide], gather()[from wide to long], spread()[from long to wide]
But, we’ll learn to transform data with pivot_longer and pivot_wider
Now, let’s try to pivot our data to the wide format.
When we pivot wider, it’s going to look something like this:
wdi_wide <- wdi %>% pivot_wider(names_from=year,values_from=gdp)
wdi_wide
We can transform the data into long format with the function pivot_longer.
It’ll look something like this:
Before I pivot, I’m going to check the names of my columns to see what columns I need to transform.
names(wdi_wide)
## [1] "iso2c" "country" "iso3c" "region" "capital" "longitude"
## [7] "latitude" "income" "lending" "2009" "2005" "2010"
## [13] "2006" "2007" "2012" "2011" "2000" "2014"
## [19] "2013" "2002" "2008" "2004" "2015" "2003"
## [25] "2001"
names(wdi_wide) # checking the names of the columns to know what I should grab
## [1] "iso2c" "country" "iso3c" "region" "capital" "longitude"
## [7] "latitude" "income" "lending" "2009" "2005" "2010"
## [13] "2006" "2007" "2012" "2011" "2000" "2014"
## [19] "2013" "2002" "2008" "2004" "2015" "2003"
## [25] "2001"
pivot_longer(wdi_wide, c('2005':'2002'), names_to='year',values_to='gdp')
Another handy thing to learn in R is separate
let’s say you have dates that you want to create two columns with. You can use the function separate to help you create two distinct columns in R.
#include Python code example
stocks = read.csv("all_stocks_5yr.csv")
date_sep <- stocks %>% separate(date, into = c("year", "month", "day"), sep = '-')
#the arrow denotes assigning an object - the direction of the arrow matters, but you can see the placement before or after the code is equivalent. The code below works exactly the same as the code above.
stocks %>% separate(date, into = c("year", "month", "day"), sep = '-') -> dat_sep
#equivalent to
dat_sep <- stocks %>% separate(date, into = c("year", "month", "day"), sep = '-')
date_sep
Separates sibling is Unite - where you can combine two columns into one.
date_sep %>% unite(date, c("month", "day", "year"), sep = "/")
Now Let’s get to running those regressions
The general format is that you will specify the model as the function and inside that function you will define the regression model that you want to run.
Stata’s “reg” is R’s “lm” which stands for linear model and is at the core of regression analysis.
The model will look something like this:
lm(y ~ x1 + x2 + x3 + ..., data = df)
You can also call each of the columns within a dataframe like this:
lm(df$y ~ df$x1 + df$x2 + df$x3 + ...) But, I prefer
simplicity
Let’s try running a basic OLS regression with our jobs dataset.
jobs = read.csv('/Users/mkaltenberg/Documents/GitHub/Data_Analysis_Python_R/Becoming a Viz Kid/job-automation-probability.csv')
ols <- lm(prob ~ average_ann_wage + numbEmployed , data = jobs)
Alright, so we got a regression!
We can view some of the results in the stored item on the left. Or let’s look into it with a function summary()
summary(ols)
##
## Call:
## lm(formula = prob ~ average_ann_wage + numbEmployed, data = jobs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72928 -0.24991 0.06363 0.24745 0.92316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.213e-01 2.622e-02 35.130 <2e-16 ***
## average_ann_wage -6.971e-06 4.031e-07 -17.293 <2e-16 ***
## numbEmployed 1.168e-08 2.761e-08 0.423 0.672
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3078 on 699 degrees of freedom
## Multiple R-squared: 0.3029, Adjusted R-squared: 0.3009
## F-statistic: 151.8 on 2 and 699 DF, p-value: < 2.2e-16
That’s better! Ok, so, we can see all of our general statistics here. We can also view specific parts by using the dollar sign to indicate a part of the output we want to view
summary(ols)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.212632e-01 2.622403e-02 35.1304979 1.484084e-156
## average_ann_wage -6.970858e-06 4.030936e-07 -17.2933997 4.776142e-56
## numbEmployed 1.167886e-08 2.761272e-08 0.4229523 6.724601e-01
You can run a subset of the data utilizing filter and grepl. NOTE the difference in the parenthesis.
The equivalent in stata would be: reg prob average_ann_wage numbEmployed if education != “No formal educational credential”
ols2 <- lm(prob ~ average_ann_wage + numbEmployed , data = jobs %>% filter(!(grepl("No formal educational credential", education))))
summary(ols2)
##
## Call:
## lm(formula = prob ~ average_ann_wage + numbEmployed, data = jobs %>%
## filter(!(grepl("No formal educational credential", education))))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69748 -0.27219 0.03687 0.27719 0.90587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.863e-01 3.016e-02 29.383 <2e-16 ***
## average_ann_wage -6.556e-06 4.447e-07 -14.743 <2e-16 ***
## numbEmployed -1.239e-08 4.216e-08 -0.294 0.769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3202 on 601 degrees of freedom
## Multiple R-squared: 0.2656, Adjusted R-squared: 0.2632
## F-statistic: 108.7 on 2 and 601 DF, p-value: < 2.2e-16
This is the same as this format (reminder of pipes and how we can use them)
jobs_noedu =
jobs %>%
filter(education != "No formal educational credential")
# filter(!(grepl("No formal educational credential", name))) ## This also works
ols2 <- lm(prob ~ average_ann_wage + numbEmployed , data = jobs_noedu)
summary(ols2)
##
## Call:
## lm(formula = prob ~ average_ann_wage + numbEmployed, data = jobs_noedu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69748 -0.27219 0.03687 0.27719 0.90587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.863e-01 3.016e-02 29.383 <2e-16 ***
## average_ann_wage -6.556e-06 4.447e-07 -14.743 <2e-16 ***
## numbEmployed -1.239e-08 4.216e-08 -0.294 0.769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3202 on 601 degrees of freedom
## Multiple R-squared: 0.2656, Adjusted R-squared: 0.2632
## F-statistic: 108.7 on 2 and 601 DF, p-value: < 2.2e-16
Often (like REALLY REALLY often) we want to include robust standard errors. In stata, this is an option in the command reg.
White Standard Error Adjustment (Robust standard errors) in Stata: reg prob average_ann_wage numbEmployed, r
In R, it’s an entirely new function. So, let’s import the package estimates
library(estimatr)
ols1_robust = lm_robust(prob ~ average_ann_wage + numbEmployed , data = jobs)
summary(ols1_robust)
##
## Call:
## lm_robust(formula = prob ~ average_ann_wage + numbEmployed, data = jobs)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower
## (Intercept) 9.213e-01 2.860e-02 32.2078 3.044e-140 8.651e-01
## average_ann_wage -6.971e-06 4.705e-07 -14.8151 2.182e-43 -7.895e-06
## numbEmployed 1.168e-08 2.046e-08 0.5708 5.683e-01 -2.849e-08
## CI Upper DF
## (Intercept) 9.774e-01 699
## average_ann_wage -6.047e-06 699
## numbEmployed 5.185e-08 699
##
## Multiple R-squared: 0.3029 , Adjusted R-squared: 0.3009
## F-statistic: 113.5 on 2 and 699 DF, p-value: < 2.2e-16
A better way to do this so that the model can be incorporated in stargazer is the sandwhich pacakge. estimatr for some reason doesn’t play well with stargazer.
library(sandwich)
library(lmtest) # to use coeftest
## Warning: package 'lmtest' was built under R version 4.0.5
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# vcovHC is the heteroskedasticity consistent estimation of the covariance matrix of the coefficient estimates
coeftest(ols, vcov = vcovHC(ols, type="HC0"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.2126e-01 2.8264e-02 32.5951 <2e-16 ***
## average_ann_wage -6.9709e-06 4.6296e-07 -15.0571 <2e-16 ***
## numbEmployed 1.1679e-08 2.0139e-08 0.5799 0.5622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#This gives your the standard errors
Another thing you may want to do is include a dummy variable in your regression.
Generally, we consider this factors. In stata, you can include factors as i.dummy_variable
R makes this pretty easy - it automatically knows that you are using a string variable and will create categorical variables (factors) out of that.
ols_edu = lm_robust(prob ~ average_ann_wage + numbEmployed + factor(education), data = jobs)
summary(ols_edu)
##
## Call:
## lm_robust(formula = prob ~ average_ann_wage + numbEmployed +
## factor(education), data = jobs)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 6.311e-01 5.882e-02
## average_ann_wage -2.834e-06 5.229e-07
## numbEmployed -5.931e-09 2.179e-08
## factor(education)Bachelor's degree -1.842e-01 5.631e-02
## factor(education)Doctoral or professional degree -1.834e-01 7.130e-02
## factor(education)High school diploma or equivalent 1.998e-01 5.262e-02
## factor(education)Master's degree -2.956e-01 6.292e-02
## factor(education)No formal educational credential 2.471e-01 5.597e-02
## factor(education)Postsecondary nondegree award -1.187e-02 6.813e-02
## factor(education)Some college, no degree 1.619e-01 1.297e-01
## t value Pr(>|t|) CI Lower
## (Intercept) 10.7300 5.952e-25 5.156e-01
## average_ann_wage -5.4190 8.281e-08 -3.860e-06
## numbEmployed -0.2722 7.856e-01 -4.872e-08
## factor(education)Bachelor's degree -3.2710 1.125e-03 -2.947e-01
## factor(education)Doctoral or professional degree -2.5728 1.030e-02 -3.234e-01
## factor(education)High school diploma or equivalent 3.7970 1.593e-04 9.649e-02
## factor(education)Master's degree -4.6982 3.167e-06 -4.192e-01
## factor(education)No formal educational credential 4.4160 1.166e-05 1.373e-01
## factor(education)Postsecondary nondegree award -0.1743 8.617e-01 -1.456e-01
## factor(education)Some college, no degree 1.2484 2.123e-01 -9.274e-02
## CI Upper DF
## (Intercept) 7.466e-01 692
## average_ann_wage -1.807e-06 692
## numbEmployed 3.685e-08 692
## factor(education)Bachelor's degree -7.363e-02 692
## factor(education)Doctoral or professional degree -4.344e-02 692
## factor(education)High school diploma or equivalent 3.031e-01 692
## factor(education)Master's degree -1.721e-01 692
## factor(education)No formal educational credential 3.570e-01 692
## factor(education)Postsecondary nondegree award 1.219e-01 692
## factor(education)Some college, no degree 4.166e-01 692
##
## Multiple R-squared: 0.4489 , Adjusted R-squared: 0.4417
## F-statistic: 93.9 on 9 and 692 DF, p-value: < 2.2e-16
There is a convenient syntax on how to include interaction terms:
x1:x2 “crosses” the variables (equivalent to including only the x1 × x2 interaction term)
x1/x2 “nests” the second variable within the first (equivalent to x1 + x1:x2)
x1*x2 includes all parent and interaction terms (equivalent to x1 + x2 + x1:x2)
Almost always you’ll use * (all parent and interaction terms). There are situations in which you wouldn’t, but it’s rare.
So, let’s check it out:
int = lm_robust(prob ~ average_ann_wage + numbEmployed*education, data = jobs)
summary(int)
##
## Call:
## lm_robust(formula = prob ~ average_ann_wage + numbEmployed *
## education, data = jobs)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 6.664e-01 7.856e-02
## average_ann_wage -2.847e-06 5.350e-07
## numbEmployed -5.138e-07 8.928e-07
## educationBachelor's degree -2.247e-01 7.686e-02
## educationDoctoral or professional degree -2.147e-01 9.046e-02
## educationHigh school diploma or equivalent 1.742e-01 7.376e-02
## educationMaster's degree -2.848e-01 8.770e-02
## educationNo formal educational credential 2.022e-01 7.617e-02
## educationPostsecondary nondegree award -4.516e-02 9.087e-02
## educationSome college, no degree 8.691e-03 1.601e-01
## numbEmployed:educationBachelor's degree 5.455e-07 8.975e-07
## numbEmployed:educationDoctoral or professional degree 4.856e-07 9.151e-07
## numbEmployed:educationHigh school diploma or equivalent 4.514e-07 8.946e-07
## numbEmployed:educationMaster's degree -1.708e-07 9.778e-07
## numbEmployed:educationNo formal educational credential 5.334e-07 8.929e-07
## numbEmployed:educationPostsecondary nondegree award 4.995e-07 9.250e-07
## numbEmployed:educationSome college, no degree 6.669e-07 9.110e-07
## t value Pr(>|t|)
## (Intercept) 8.48369 1.343e-16
## average_ann_wage -5.32124 1.398e-07
## numbEmployed -0.57550 5.651e-01
## educationBachelor's degree -2.92354 3.575e-03
## educationDoctoral or professional degree -2.37356 1.789e-02
## educationHigh school diploma or equivalent 2.36178 1.847e-02
## educationMaster's degree -3.24796 1.219e-03
## educationNo formal educational credential 2.65441 8.129e-03
## educationPostsecondary nondegree award -0.49705 6.193e-01
## educationSome college, no degree 0.05428 9.567e-01
## numbEmployed:educationBachelor's degree 0.60777 5.435e-01
## numbEmployed:educationDoctoral or professional degree 0.53070 5.958e-01
## numbEmployed:educationHigh school diploma or equivalent 0.50455 6.140e-01
## numbEmployed:educationMaster's degree -0.17471 8.614e-01
## numbEmployed:educationNo formal educational credential 0.59745 5.504e-01
## numbEmployed:educationPostsecondary nondegree award 0.53996 5.894e-01
## numbEmployed:educationSome college, no degree 0.73201 4.644e-01
## CI Lower CI Upper
## (Intercept) 5.122e-01 8.207e-01
## average_ann_wage -3.898e-06 -1.797e-06
## numbEmployed -2.267e-06 1.239e-06
## educationBachelor's degree -3.756e-01 -7.380e-02
## educationDoctoral or professional degree -3.923e-01 -3.710e-02
## educationHigh school diploma or equivalent 2.938e-02 3.190e-01
## educationMaster's degree -4.570e-01 -1.126e-01
## educationNo formal educational credential 5.263e-02 3.518e-01
## educationPostsecondary nondegree award -2.236e-01 1.332e-01
## educationSome college, no degree -3.057e-01 3.231e-01
## numbEmployed:educationBachelor's degree -1.217e-06 2.308e-06
## numbEmployed:educationDoctoral or professional degree -1.311e-06 2.282e-06
## numbEmployed:educationHigh school diploma or equivalent -1.305e-06 2.208e-06
## numbEmployed:educationMaster's degree -2.091e-06 1.749e-06
## numbEmployed:educationNo formal educational credential -1.220e-06 2.286e-06
## numbEmployed:educationPostsecondary nondegree award -1.317e-06 2.316e-06
## numbEmployed:educationSome college, no degree -1.122e-06 2.456e-06
## DF
## (Intercept) 685
## average_ann_wage 685
## numbEmployed 685
## educationBachelor's degree 685
## educationDoctoral or professional degree 685
## educationHigh school diploma or equivalent 685
## educationMaster's degree 685
## educationNo formal educational credential 685
## educationPostsecondary nondegree award 685
## educationSome college, no degree 685
## numbEmployed:educationBachelor's degree 685
## numbEmployed:educationDoctoral or professional degree 685
## numbEmployed:educationHigh school diploma or equivalent 685
## numbEmployed:educationMaster's degree 685
## numbEmployed:educationNo formal educational credential 685
## numbEmployed:educationPostsecondary nondegree award 685
## numbEmployed:educationSome college, no degree 685
##
## Multiple R-squared: 0.4527 , Adjusted R-squared: 0.4399
## F-statistic: 56.23 on 16 and 685 DF, p-value: < 2.2e-16
There are a LOT of packages out there to export and display various statistical results. You can take a look at them here
There are very few that export directly to word (flextable and huxtable) - rather, you would export html output and then copy and paste it into word or for a presentation or export a pdf from R markdown for a presentation. The reason that there are such few packages that export to word is because it’s a nightmare of bugs. Also, most researchers using R/Python use LaTex for formatting, thus most packages export to LaTex quite easily.
We will focus on two of them that make output in a variety of formats easy and pretty - stargazer and modelsummary.
Most people who use R for data analysis use stargazer as a way to present regression tables and summary statistics. It is by far the most popular package for this - possibly because it’s so easy to use the output looks very nice
##
## My Data Models
## ===========================================================================================
## Dependent variable:
## ------------------------------------------------
## prob
## (1) (2)
## -------------------------------------------------------------------------------------------
## average_ann_wage -0.00001***
## (0.00000)
##
## educationBachelor's degree -0.202***
## (0.049)
##
## educationDoctoral or professional degree -0.216***
## (0.080)
##
## educationHigh school diploma or equivalent 0.201***
## (0.045)
##
## educationMaster's degree -0.306***
## (0.067)
##
## educationNo formal educational credential 0.250***
## (0.053)
##
## educationPostsecondary nondegree award -0.013
## (0.060)
##
## educationSome college, no degree 0.142
## (0.146)
##
## numbEmployed 0.000 -0.000
## (0.00000) (0.00000)
##
## median_ann_wage -0.00000***
## (0.00000)
##
## Constant 0.921*** 0.613***
## (0.026) (0.052)
##
## -------------------------------------------------------------------------------------------
## Observations 702 702
## R2 0.303 0.444
## Adjusted R2 0.301 0.436
## Residual Std. Error 0.308 (df = 699) 0.276 (df = 692)
## F Statistic 151.845*** (df = 2; 699) 61.304*** (df = 9; 692)
## ===========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We can export the table into html with the option “out” - let’s check out what the files looks like after we export it.
##
## <table style="text-align:center"><caption><strong>My models</strong></caption>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="2"><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="2" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td colspan="2">prob</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">average_ann_wage</td><td>-0.00001<sup>***</sup></td><td></td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationBachelor's degree</td><td></td><td>-0.202<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.049)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationDoctoral or professional degree</td><td></td><td>-0.216<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.080)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationHigh school diploma or equivalent</td><td></td><td>0.201<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.045)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationMaster's degree</td><td></td><td>-0.306<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.067)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationNo formal educational credential</td><td></td><td>0.250<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.053)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationPostsecondary nondegree award</td><td></td><td>-0.013</td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.060)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationSome college, no degree</td><td></td><td>0.142</td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.146)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">numbEmployed</td><td>0.000</td><td>-0.000</td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">median_ann_wage</td><td></td><td>-0.00000<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>0.921<sup>***</sup></td><td>0.613<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.026)</td><td>(0.052)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>702</td><td>702</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.303</td><td>0.444</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.301</td><td>0.436</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>0.308 (df = 699)</td><td>0.276 (df = 692)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>151.845<sup>***</sup> (df = 2; 699)</td><td>61.304<sup>***</sup> (df = 9; 692)</td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="2" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>
We can easily see the regression when we directly input the code in R Markdown
There are many options within stargazer that we can play around to get our tables “just right” - and you will spend a lot of time doing this. Every researcher spends more time than they like to get their tables correct.
A very useful cheat sheet on stargazer can be found here
##
## <table style="text-align:center"><caption><strong>OLS Results</strong></caption>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="2"><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="2" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td colspan="2">Prob. of Automated Job</td></tr>
## <tr><td style="text-align:left"></td><td>OLS</td><td>OLS</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Avg. Salary</td><td>-0.00001<sup>***</sup></td><td></td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">BA</td><td></td><td>-0.202<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.049)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">PhD+</td><td></td><td>-0.216<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.080)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">HS</td><td></td><td>0.201<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.045)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">MA</td><td></td><td>-0.306<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.067)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">None</td><td></td><td>0.250<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.053)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Postsecond</td><td></td><td>-0.013</td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.060)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Some College</td><td></td><td>0.142</td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.146)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">N emp</td><td>0.000</td><td>-0.000</td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Med Salary</td><td></td><td>-0.00000<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>0.921<sup>***</sup></td><td>0.613<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.026)</td><td>(0.052)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>702</td><td>702</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.303</td><td>0.444</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.301</td><td>0.436</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>0.308 (df = 699)</td><td>0.276 (df = 692)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>151.845<sup>***</sup> (df = 2; 699)</td><td>61.304<sup>***</sup> (df = 9; 692)</td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="2" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>
###Robust standard errors and stargazer
##
## <table style="text-align:center"><caption><strong>OLS Results</strong></caption>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="2"><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="2" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td>Prob. of Automated Job</td><td></td></tr>
## <tr><td style="text-align:left"></td><td>OLS</td><td>OLS robust</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Avg. Salary</td><td>-0.00001<sup>***</sup></td><td>-0.00001<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">BA</td><td>0.000</td><td>0.000</td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">PhD+</td><td>0.921<sup>***</sup></td><td>0.921<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.026)</td><td>(0.028)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>702</td><td></td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.303</td><td></td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.301</td><td></td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>0.308 (df = 699)</td><td></td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>151.845<sup>***</sup> (df = 2; 699)</td><td></td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="2" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>
#vtable
For quick summary statistics you can you Vtable
## Loading required package: kableExtra
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| X_…rank | 702 | 351.5 | 202.794 | 1 | 176.25 | 526.75 | 702 |
| prob | 702 | 0.536 | 0.368 | 0.003 | 0.11 | 0.89 | 0.99 |
| Average.annual.wage | 702 | 55645.278 | 28985.949 | 20460 | 35757.5 | 66477.5 | 232870 |
| len | 702 | 33.479 | 17.486 | 6 | 19 | 44 | 92 |
| probability | 702 | 0.536 | 0.368 | 0.003 | 0.11 | 0.89 | 0.99 |
| numbEmployed | 702 | 186835.726 | 423140.174 | 370 | 17307.5 | 151702.5 | 4528550 |
| median_ann_wage | 702 | 51345.468 | 26912.979 | 19290 | 33355 | 61652.5 | 232870 |
| average_ann_wage | 702 | 55645.278 | 28985.949 | 20460 | 35757.5 | 66477.5 | 232870 |
There are also other options that exist with datasummary that you can check out here